Exoplanet Modeling
The figure below showcases a spectrum integral to exoplanet detection, charting the transit depth against the wavelength in microns. Every observation point in the spectrum carries an inherent uncertainty, denoted by the vertical error bars. To decode and potentially minimize this uncertainty, it’s pivotal to fathom how features like planet radius
, planet temperature
, and the logarithmic concentrations of H₂O
, CO₂
, CO
, CH₄
, and NH₃
influence the transit depth. Leveraging interpretative tools like SHAP can provide insights into how these exoplanetary features impact the observed transit depth, refining our understanding and accuracy in exoplanet spectral analysis.
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
='my_spectrum.png') Image(filename
# Load the CSV file into a Pandas DataFrame
= pd.read_csv("small_astro.csv") df
= df.copy()
df_backup 10) df_backup.head(
Unnamed: 0 | planet_radius | planet_temp | log_h2o | log_co2 | log_co | log_ch4 | log_nh3 | x1 | x2 | … | x43 | x44 | x45 | x46 | x47 | x48 | x49 | x50 | x51 | x52 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.559620 | 863.394770 | -8.865868 | -6.700707 | -5.557561 | -8.957615 | -3.097540 | 0.003836 | 0.003834 | … | 0.003938 | 0.003941 | 0.003903 | 0.003931 | 0.003983 | 0.004019 | 0.004046 | 0.004072 | 0.004054 | 0.004056 |
1 | 2 | 1.118308 | 1201.700465 | -4.510258 | -8.228966 | -3.565427 | -7.807424 | -3.633658 | 0.015389 | 0.015148 | … | 0.015450 | 0.015447 | 0.015461 | 0.015765 | 0.016099 | 0.016376 | 0.016549 | 0.016838 | 0.016781 | 0.016894 |
2 | 3 | 0.400881 | 1556.096477 | -7.225472 | -6.931472 | -3.081975 | -8.567854 | -5.378472 | 0.002089 | 0.002073 | … | 0.001989 | 0.002168 | 0.002176 | 0.002123 | 0.002079 | 0.002081 | 0.002106 | 0.002167 | 0.002149 | 0.002185 |
3 | 4 | 0.345974 | 1268.624884 | -7.461157 | -5.853334 | -3.044711 | -5.149378 | -3.815568 | 0.002523 | 0.002392 | … | 0.002745 | 0.003947 | 0.004296 | 0.003528 | 0.003352 | 0.003629 | 0.003929 | 0.004363 | 0.004216 | 0.004442 |
4 | 5 | 0.733184 | 1707.323564 | -4.140844 | -7.460278 | -3.181793 | -5.996593 | -4.535345 | 0.002957 | 0.002924 | … | 0.003402 | 0.003575 | 0.003667 | 0.003740 | 0.003823 | 0.003904 | 0.003897 | 0.004004 | 0.004111 | 0.004121 |
5 | 6 | 0.161165 | 620.185809 | -4.875000 | -5.074766 | -3.861240 | -5.388011 | -8.390503 | 0.000444 | 0.000442 | … | 0.000432 | 0.000486 | 0.000473 | 0.000462 | 0.000447 | 0.000455 | 0.000455 | 0.000457 | 0.000463 | 0.000474 |
6 | 7 | 0.194312 | 900.597575 | -8.299899 | -6.850709 | -4.314491 | -3.712038 | -3.951455 | 0.001794 | 0.001721 | … | 0.001048 | 0.001052 | 0.000948 | 0.000976 | 0.001122 | 0.001274 | 0.001395 | 0.001522 | 0.001456 | 0.001823 |
7 | 8 | 1.132685 | 1176.443900 | -6.765865 | -7.398548 | -3.378307 | -3.763737 | -5.881384 | 0.012950 | 0.012946 | … | 0.014019 | 0.013871 | 0.013810 | 0.013902 | 0.014024 | 0.014150 | 0.014298 | 0.014392 | 0.014401 | 0.015042 |
8 | 9 | 0.158621 | 1189.209841 | -8.376041 | -6.321977 | -3.243900 | -8.711851 | -3.449195 | 0.000444 | 0.000445 | … | 0.000562 | 0.000595 | 0.000571 | 0.000590 | 0.000628 | 0.000663 | 0.000692 | 0.000734 | 0.000718 | 0.000736 |
9 | 10 | 0.660642 | 528.023669 | -3.804286 | -8.919378 | -4.686964 | -8.150277 | -3.068319 | 0.008997 | 0.009035 | … | 0.009435 | 0.009375 | 0.009315 | 0.009357 | 0.009563 | 0.009739 | 0.009821 | 0.009890 | 0.009819 | 0.009734 |
10 rows × 60 columns
# Columns of interest for planetary and chemical properties and transit depth
= df.columns[1:8]
feature_columns = 'x1' # pick the first wavelength
transit_depth_column
# Calculate mean and variance for features and transit depth
= df[feature_columns].mean()
feature_mean = df[feature_columns].var()
feature_variance = df[transit_depth_column].mean()
transit_depth_mean = df[transit_depth_column].var()
transit_depth_variance
# Visualize the distributions
= plt.subplots(1, 8, figsize=(18, 4))
fig, axes for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[i])
sns.histplot(df[col], binsf'{col}\nMean: {feature_mean[col]:.2f}\nVariance: {feature_variance[col]:.2f}')
axes[i].set_title(
# Add visualization for transit depth
=20, kde=True, ax=axes[-1])
sns.histplot(df[transit_depth_column], bins-1].set_title(f'{transit_depth_column}\nMean: {transit_depth_mean:.2f}\nVariance: {transit_depth_variance:.2f}')
axes[
plt.tight_layout()
plt.show()
feature_mean, feature_variance, transit_depth_mean, transit_depth_variance
(planet_radius 0.703714
planet_temp 1073.229674
log_h2o -5.934889
log_co2 -6.873009
log_co -4.497141
log_ch4 -5.799850
log_nh3 -6.051791
dtype: float64,
planet_radius 0.198990
planet_temp 154495.743225
log_h2o 3.130868
log_co2 1.649658
log_co 0.738255
log_ch4 3.208283
log_nh3 3.050545
dtype: float64,
0.009442470522591322,
0.00016172106267489707)
Exoplanet Feature Distributions
The provided visualizations and statistics shed light on the distribution of various exoplanetary features and the observed transit depth at the x1
wavelength.
Planet Radius: This feature, with a mean value of approximately 0.7037 and variance of 0.1989, mostly lies between 0.5 and 1.5 as depicted in its histogram.
Planet Temperature: Exhibiting a wider spread, the temperature has a mean of approximately 1073.29 K and variance of 154495.74 K².
Logarithmic Concentrations:
- H₂O: Mean concentration of -5.93 with a variance of 3.13.
- CO₂: Mean concentration of -6.87 with a variance of 1.65.
- CO: Mean concentration of -4.50 with a variance of 0.74.
- CH₄: Mean concentration of -5.80 with a variance of 3.21.
- NH₃: Mean concentration of -6.05 with a variance of 3.05.
Transit Depth at x1 Wavelength: This depth, crucial for exoplanet detection, has an almost singular value near 0, with a mean of approximately 0.0094 and a negligible variance of 0.00006.
These distributions and their accompanying statistics offer invaluable insights into the data’s nature and its inherent variability, essential for accurate spectral analysis and interpretation.
# Import required libraries for modeling and SHAP values
import xgboost as xgb
import shap
# Prepare the feature matrix (X) and the target vector (y)
= df.iloc[:, 1:8]
X = df['x1']
y
# Train an XGBoost model
= xgb.XGBRegressor(objective ='reg:squarederror')
model
model.fit(X, y)
# Initialize the SHAP explainer
= shap.Explainer(model)
explainer
# Calculate SHAP values
= explainer(X)
shap_values
# Summary plot for SHAP values
='Shapley Feature Importance') shap.summary_plot(shap_values, X, title
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
SHAP (SHapley Additive exPlanations) values stem from cooperative game theory and provide a way to interpret machine learning model predictions. Each feature in a model is analogous to a player in a game, and the contribution of each feature to a prediction is like the payout a player receives in the game. In SHAP, we calculate the value of each feature by considering all possible combinations (coalitions) of features, assessing the change in prediction with and without that feature. The resulting value, the Shapley value, represents the average contribution of a feature to all possible predictions.
The summary plot visualizes these values. Each dot represents a SHAP value for a specific instance of a feature; positive values (right of the centerline) indicate that a feature increases the model’s output, while negative values (left of the centerline) suggest a decrease. The color depicts the actual value of the feature for the given instance, enabling a comprehensive view of feature influence across the dataset.
# Import the Random Forest Regressor and visualization libraries
from sklearn.ensemble import RandomForestRegressor
# Train a Random Forest model
= RandomForestRegressor(n_estimators=100, random_state=42)
rf_model
rf_model.fit(X, y)
# Extract feature importances
= rf_model.feature_importances_
feature_importances
# Create a DataFrame for visualization
= pd.DataFrame({
importance_df 'Feature': feature_columns,
'Importance': feature_importances
='Importance', ascending=True)
}).sort_values(by
# Create a horizontal bar chart for feature importances
=(10, 6))
plt.figure(figsize'Feature'], importance_df['Importance'], color='dodgerblue')
plt.barh(importance_df['Importance')
plt.xlabel('')
plt.ylabel('Random Forest Feature Importance')
plt.title(=13) # Increase the font size of the x-axis tick labels
plt.xticks(fontsize=14) # Increase the font size of the x-axis tick labels
plt.yticks(fontsize# Save the figure before showing it
'random_forest_importance_plot.png', bbox_inches='tight', dpi=300) # 'bbox_inches' ensures the entire plot is saved
plt.savefig(
plt.show()
feature_importances
array([0.71850575, 0.10367982, 0.01655567, 0.07232808, 0.05603188,
0.0146824 , 0.0182164 ])
Random Forest Feature Importance Analysis
To gain insights into which exoplanetary features most influence the observed transit depth, a Random Forest Regressor was utilized. Here’s an outline of the procedure:
Model Initialization: A Random Forest Regressor model was instantiated with 100 trees and a fixed random seed of 42 for reproducibility.
Model Training: The model was trained on the feature set
X
and target variabley
.Feature Importance Extraction: After training, the importance of each feature was extracted using the
feature_importances_
attribute of the trained model.Data Preparation for Visualization: A DataFrame was created to house each feature alongside its respective importance. The features were then sorted in ascending order of importance for better visualization.
Visualization: A horizontal bar chart was plotted to showcase the importance of each feature. The chart offers a clear visual comparison, with the y-axis representing the features and the x-axis indicating their importance. Special attention was paid to font size adjustments for better readability. Furthermore, before displaying the chart, it was saved as a PNG image with high resolution.
The resulting visualization, titled ‘Random Forest Feature Importance’, provides a clear understanding of the relative significance of each feature in predicting the transit depth, as discerned by the Random Forest model.
from IPython.display import Image
='PME.png') Image(filename
PME Feature Importance Analysis
Using a modeling technique, the impact of different exoplanetary features on the observed transit depth was assessed, and their importance was visualized in the attached figure titled ‘PME Feature Importance’. Here’s a breakdown of the visual representation:
Most Influential Feature: The
planet_radius
stands out as the most influential feature with the highest PME (Predictive Modeling Estimate) value. This suggests that the radius of the planet plays a pivotal role in determining the observed transit depth.Other Features: Logarithmic concentrations of gases, such as
log_co2
,log_co
,log_nh3
,log_h2o
, andlog_ch4
, also exhibit varying degrees of importance. Among these,log_co2
andlog_co
are the more significant contributors compared to others.Least Influential Feature: The
planet_temp
, representing the temperature of the planet, has the least importance in this analysis, suggesting its minimal role in influencing the transit depth, at least according to the PME metric.Visual Clarity: The horizontal bar chart offers a lucid comparison of feature importances. Each bar’s length represents the PME value of a feature, providing a direct visual cue to its significance.
Interpretation: This visualization aids in discerning which exoplanetary characteristics are most relevant when predicting the transit depth using the given model. It can guide future analyses by highlighting key features to focus on or, conversely, those that might be less consequential.
By examining the ‘PME Feature Importance’ chart, one gains a deeper understanding of the relative significance of each feature in predicting the transit depth within this specific modeling context.
Uncertainty Quantification and Feature Reduction in PME Feature Importance Analysis
When delving deep into the realms of uncertainty quantification and feature reduction in predictive modeling, it’s crucial to evaluate feature importance metrics critically. The provided PME (Predictive Modeling Estimate) Feature Importance Analysis offers a valuable lens for this task. Below is a nuanced exploration of its significance in the described contexts:
Uncertainty Quantification:
- Origin of Uncertainty:
- In exoplanetary spectral analysis, uncertainty can arise from observational noise, instrumental errors, or intrinsic variability of the observed phenomena. This uncertainty often manifests in the form of error bars in spectra, like the ones shown in transit depth against wavelengths.
- Feature Impact on Uncertainty:
- The degree of influence a feature has on the predicted outcome can be a proxy for how that feature might contribute to the overall predictive uncertainty. If a feature like
planet_radius
has a high PME value, it might be a critical determinant of transit depth. Any uncertainty in measuring or estimating theplanet_radius
could propagate and significantly affect the prediction’s reliability.
- The degree of influence a feature has on the predicted outcome can be a proxy for how that feature might contribute to the overall predictive uncertainty. If a feature like
- PME as a Measure of Stochastic Uncertainty:
- The PME values themselves might be obtained by analyzing a model’s sensitivity to perturbations in input features. A high PME value indicates that slight changes in the feature can lead to notable changes in the output, thereby implying a greater inherent stochastic uncertainty tied to that feature.
Feature Reduction:
- Identifying Critical Features:
- When dealing with a multitude of features, not all may be equally relevant. The PME analysis provides a hierarchy of feature importance. In this case, while
planet_radius
emerges as crucial,planet_temp
appears less consequential. This differentiation is fundamental for feature reduction, guiding us on which features to prioritize in modeling.
- When dealing with a multitude of features, not all may be equally relevant. The PME analysis provides a hierarchy of feature importance. In this case, while
- Reducing Dimensionality & Complexity:
- In data-driven modeling, especially with limited data points, overfitting is a genuine concern. By understanding which features significantly influence the predictions (like
planet_radius
orlog_co2
), one can potentially reduce the model’s complexity and the risk of overfitting by focusing only on these paramount features.
- In data-driven modeling, especially with limited data points, overfitting is a genuine concern. By understanding which features significantly influence the predictions (like
- Informing Experimental Design:
- If further observational or experimental data is required, knowing feature importances can guide where resources are channeled. For instance, more precise measurements might be sought for features with high PME values, as their accurate estimation is vital for reliable predictions.
- Trade-off with Predictive Performance:
- It’s essential to understand that while feature reduction can simplify models and make them more interpretable, there’s always a trade-off with predictive performance. Removing features based on their PME values should be done judiciously, ensuring that the model’s predictive capability isn’t unduly compromised.
In summary, the ‘PME Feature Importance’ chart isn’t merely a representation of feature significance but serves as a cornerstone for rigorous analytical decisions in uncertainty quantification and feature reduction. Analyzing such importance metrics within the broader context of the problem at hand ensures that models are both robust and interpretable, catering effectively to the dual objectives of predictive accuracy and analytical clarity.
Weighted Principal Component Analysis (PCA) on Exoplanet Data
Conceptual Overview:
PCA is a method used to emphasize variation and capture strong patterns in a dataset. The “weighted PCA” approach fine-tunes this by considering the importance of different features, effectively giving more attention to features deemed vital.
Detailed Breakdown:
- Standardization:
- Before applying PCA, the dataset is standardized to give each feature a mean of 0 and a variance of 1. This is essential because PCA is influenced by the scale of the data.
- Weighted Features:
- Features are weighted according to their importance, as identified by the Random Forest model. Taking the square root of the weights ensures proper scaling during matrix multiplication in PCA.
- PCA Application:
- PCA projects the data into a new space defined by its principal components. The first two components often hold most of the dataset’s variance, making them crucial for visualization.
- Visualization:
- The scatter plot visualizes the data in the space of the first two principal components. The color indicates
Transit Depth (x1)
, shedding light on how this parameter varies across the main patterns in the data.
- The scatter plot visualizes the data in the space of the first two principal components. The color indicates
- Explained Variance:
- This provides an understanding of how much original variance the first two components capture. A high percentage indicates that the PCA representation retains much of the data’s original structure.
Significance:
- Data Compression:
- PCA offers a simplified yet rich representation of the data, which can be invaluable for visualization and pattern recognition.
- Feature Emphasis:
- Using feature importances ensures the PCA representation highlights the most critical patterns related to influential features.
- Framework for Further Exploration:
- Observing patterns or groupings in the PCA plot can guide subsequent investigations, pinpointing areas of interest or potential clusters.
- Efficient Data Overview:
- The visualization provides a comprehensive but digestible overview of the data, suitable for a wide range of audiences.
In essence, weighted PCA melds the dimensionality reduction capabilities of PCA with the interpretative power of feature importances, offering a profound view into the dataset’s intricate structures and relationships.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Standardize the feature matrix
= StandardScaler().fit_transform(X)
X_scaled
# Multiply each feature by its square root of importance weight for weighted PCA
# The square root is used because each feature contributes to both rows and columns in the dot product calculation
= X_scaled * np.sqrt(feature_importances)
X_weighted
# Perform PCA on the weighted data
= PCA(n_components=2)
pca = pca.fit_transform(X_weighted)
X_pca
# Plotting the first two principal components
=(10, 7))
plt.figure(figsize0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=40)
plt.scatter(X_pca[:, 'Transit Depth (x1)', rotation=270, labelpad=15)
plt.colorbar().set_label('Weighted Principal Component 1')
plt.xlabel('Weighted Principal Component 2')
plt.ylabel('Weighted PCA: First Two Principal Components')
plt.title(
plt.show()
# Variance explained by the first two principal components
= pca.explained_variance_ratio_
variance_explained variance_explained
array([0.71985328, 0.10370239])
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Your data initialization (this part is assumed, as you haven't provided it in the original code)
# X, y, feature_importances = ...
# Standardize the feature matrix
= StandardScaler().fit_transform(X)
X_scaled
# Multiply each feature by its square root of importance weight for weighted PCA
= X_scaled * np.sqrt(feature_importances)
X_weighted
# Perform PCA on the weighted data with three components
= PCA(n_components=3)
pca = pca.fit_transform(X_weighted)
X_pca
# Plotting the first three principal components
= plt.figure(figsize=(10, 7))
fig = fig.add_subplot(111, projection='3d')
ax = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis', edgecolor='k', s=40)
scatter =ax, pad=0.2).set_label('Transit Depth (x1)', rotation=270, labelpad=15)
plt.colorbar(scatter, ax'Weighted Principal Component 1')
ax.set_xlabel('Weighted Principal Component 2')
ax.set_ylabel('Weighted Principal Component 3')
ax.set_zlabel('Weighted PCA: First Three Principal Components')
plt.title(
plt.show()
# Variance explained by the first three principal components
= pca.explained_variance_ratio_
variance_explained variance_explained
array([0.71985328, 0.10370239, 0.07160805])
# side by side
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Your data initialization (this part is assumed, as you haven't provided it in the original code)
# X, y, feature_importances = ...
# Standardize the feature matrix
= StandardScaler().fit_transform(X)
X_scaled
# Multiply each feature by its square root of importance weight for weighted PCA
= X_scaled * np.sqrt(feature_importances)
X_weighted
# Create a combined figure with subplots
= plt.subplots(1, 2, figsize=(20, 7))
fig, axes
# Perform PCA on the weighted data with two components and plot
= PCA(n_components=2)
pca2 = pca2.fit_transform(X_weighted)
X_pca2 = axes[0].scatter(X_pca2[:, 0], X_pca2[:, 1], c=y, cmap='viridis', edgecolor='k', s=40)
scatter_2d =axes[0], orientation='vertical').set_label('Transit Depth (x1)', rotation=270, labelpad=15)
fig.colorbar(scatter_2d, ax0].set_xlabel('Weighted Principal Component 1')
axes[0].set_ylabel('Weighted Principal Component 2')
axes[0].set_title('Weighted PCA: First Two Principal Components')
axes[
# Perform PCA on the weighted data with three components and plot
= PCA(n_components=3)
pca3 = pca3.fit_transform(X_weighted)
X_pca3 = fig.add_subplot(1, 2, 2, projection='3d')
ax3d = ax3d.scatter(X_pca3[:, 0], X_pca3[:, 1], X_pca3[:, 2], c=y, cmap='viridis', edgecolor='k', s=40)
scatter_3d =ax3d, pad=0.2).set_label('Transit Depth (x1)', rotation=270, labelpad=15)
fig.colorbar(scatter_3d, ax'Weighted Principal Component 1')
ax3d.set_xlabel('Weighted Principal Component 2')
ax3d.set_ylabel('Weighted Principal Component 3')
ax3d.set_zlabel('Weighted PCA: First Three Principal Components')
ax3d.set_title(
# Display the combined plot
plt.tight_layout()'wPCA_2D_3D.png', bbox_inches='tight', dpi=300) # 'bbox_inches' ensures the entire plot is saved
plt.savefig(
plt.show()
import pandas as pd
import seaborn as sns
# Your data initialization (this part is assumed, as you haven't provided it in the original code)
# df, feature_columns, transit_depth_column = ...
# Reconstruct the approximate original data using the first three principal components
= np.dot(X_pca, pca.components_[:3, :]) + np.mean(X_scaled, axis=0)
reconstructed_data
# Create a DataFrame for the reconstructed data
= pd.DataFrame(reconstructed_data, columns=feature_columns)
reconstructed_df
# Visualize the approximate original histograms
= plt.subplots(1, len(feature_columns) + 1, figsize=(18, 4)) # Adjusted for the number of features
fig, axes for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[i], color='orange')
sns.histplot(reconstructed_df[col], binsf'Approx of {col}')
axes[i].set_title(
# Add visualization for actual transit depth (since it was not part of the PCA)
=20, kde=True, ax=axes[-1], color='green')
sns.histplot(df[transit_depth_column], bins-1].set_title(f'Actual {transit_depth_column}')
axes[
plt.tight_layout() plt.show()
= plt.subplots(2, 8, figsize=(18, 8))
fig, axes
# Original Data
for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[0, i])
sns.histplot(df[col], bins0, i].set_title(f'{col}\nMean: {feature_mean[col]:.2f}\nVariance: {feature_variance[col]:.2f}')
axes[
=20, kde=True, ax=axes[0, -1])
sns.histplot(df[transit_depth_column], bins0, -1].set_title(f'{transit_depth_column}\nMean: {transit_depth_mean:.2f}\nVariance: {transit_depth_variance:.2f}')
axes[
# Reconstructed Data
for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[1, i], color='orange')
sns.histplot(reconstructed_df[col], bins1, i].set_title(f'Surrogate of {col}')
axes[
=20, kde=True, ax=axes[1, -1], color='green')
sns.histplot(df[transit_depth_column], bins1, -1].set_title(f'Actual {transit_depth_column}')
axes[
plt.tight_layout() plt.show()
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Assuming data initialization and feature_importances, feature_columns, and transit_depth_column definitions exist
# Standardize the feature matrix
= StandardScaler().fit_transform(X)
X_scaled
# Multiply each feature by its square root of importance weight for weighted PCA
= X_scaled * np.sqrt(feature_importances)
X_weighted
# Initialize the figure and axes
= plt.subplots(3, len(feature_columns) + 1, figsize=(18, 12))
fig, axes
# Original Data histograms
for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[0, i])
sns.histplot(df[col], bins0, i].set_title(f'Original {col}')
axes[
=20, kde=True, ax=axes[0, -1])
sns.histplot(df[transit_depth_column], bins0, -1].set_title(f'Original {transit_depth_column}')
axes[
# Reconstruction using 2 PCs
= PCA(n_components=2)
pca2 = pca2.fit_transform(X_weighted)
X_pca2 = np.dot(X_pca2, pca2.components_[:2, :]) + np.mean(X_scaled, axis=0)
reconstructed_data_2PCs = pd.DataFrame(reconstructed_data_2PCs, columns=feature_columns)
reconstructed_df_2PCs
for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[1, i], color='orange')
sns.histplot(reconstructed_df_2PCs[col], bins1, i].set_title(f'2 PCs of {col}')
axes[
=20, kde=True, ax=axes[1, -1], color='green')
sns.histplot(df[transit_depth_column], bins1, -1].set_title(f'Original {transit_depth_column}')
axes[
# Reconstruction using 3 PCs
= PCA(n_components=3)
pca3 = pca3.fit_transform(X_weighted)
X_pca3 = np.dot(X_pca3, pca3.components_[:3, :]) + np.mean(X_scaled, axis=0)
reconstructed_data_3PCs = pd.DataFrame(reconstructed_data_3PCs, columns=feature_columns)
reconstructed_df_3PCs
for i, col in enumerate(feature_columns):
=20, kde=True, ax=axes[2, i], color='purple')
sns.histplot(reconstructed_df_3PCs[col], bins2, i].set_title(f'3 PCs of {col}')
axes[
=20, kde=True, ax=axes[2, -1], color='green')
sns.histplot(df[transit_depth_column], bins2, -1].set_title(f'Original {transit_depth_column}')
axes[
plt.tight_layout()'wPCA_combined_plot.png', bbox_inches='tight', dpi=500) # 'bbox_inches' ensures the entire plot is saved
plt.savefig(
plt.show()
Reconstruction of Exoplanet Data Using Principal Component Analysis (PCA)
The code segment aims to reconstruct exoplanet data using different numbers of principal components (PCs) from PCA, offering insights into how much information retention occurs as we vary the number of PCs.
Data Standardization:
- The features undergo standardization, ensuring each feature has a mean of 0 and variance of 1. This normalization is crucial for PCA, as the algorithm is sensitive to varying scales across features.
Weighted Features:
- Features are adjusted based on their significance as ascertained by a prior model, specifically the Random Forest. Weighting the features adjusts the emphasis the PCA places on each feature during the dimensionality reduction process.
Data Reconstruction:
- After performing PCA, the algorithm seeks to transform the reduced data back to the original high-dimensional space. This “reconstructed” data is an approximation of the original but is built using fewer dimensions.
Visualization of Reconstructions:
- Original Data Histograms:
- The initial histograms show the distributions of the original features and the transit depth.
- Reconstruction Using 2 Principal Components:
- Using only the first two PCs, the data is reconstructed and its histograms visualized. This illustrates the patterns and distributions captured when only the first two components are considered.
- Reconstruction Using 3 Principal Components:
- An analogous reconstruction is done with three PCs. The addition of a third dimension might capture more nuanced variations in the data.
Significance:
- Data Compression vs. Retention:
- The visualizations enable us to compare the reconstructions against the original data. We discern how much information is retained and what is lost as we reduce dimensions.
- Guide for Dimensionality Decision:
- By juxtaposing the original with the reconstructions, we gain insights into the optimal number of PCs to use for specific tasks, striking a balance between compression and information retention.
- Empirical Understanding:
- These histograms and visual representations offer a tangible way to grasp the abstract notion of dimensionality reduction. They elucidate how PCA captures the essence of the data while diminishing dimensions.
In conclusion, this analysis, coupled with the visualizations, equips us with a robust understanding of how PCA reconstructs data using varying numbers of components. It underscores the trade-offs involved and the implications of choosing specific dimensionality levels in data-driven tasks.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.gridspec as gridspec
# Load images
= mpimg.imread('shapley.png')
img_shap = mpimg.imread('random_forest_importance_plot.png')
img_rf = mpimg.imread('PME.png')
img_pme
# Create a grid for the subplots
= plt.figure(figsize=(20, 12))
fig_combined = gridspec.GridSpec(2, 2, width_ratios=[1, 1])
gs
# Display SHAP plot
= plt.subplot(gs[0])
ax0
ax0.imshow(img_shap)'off')
ax0.axis(
# Display RF Importance plot
= plt.subplot(gs[1])
ax1
ax1.imshow(img_rf)'off')
ax1.axis(
# Display PME image in the middle of the 2nd row
= plt.subplot(gs[2:4]) # This makes the PME plot span both columns on the second row
ax2
ax2.imshow(img_pme)'off')
ax2.axis(
plt.tight_layout()'sensitivity_combined_plot.png', bbox_inches='tight', dpi=300) # 'bbox_inches' ensures the entire plot is saved
plt.savefig(
plt.show()
This code snippet consolidates and displays three distinct plots—SHAP values, Random Forest Feature Importance, and PME Feature Importance—into a single visualization. The images are loaded and arranged in a 2x2 grid, with the SHAP and Random Forest plots on the top row, and the PME plot spanning both columns on the bottom row. After layout adjustments, the combined visualization is saved as a high-resolution PNG image titled ‘sensitivity_combined_plot.png’ and then displayed.
Reference
Changeat, Q., & Yip, K. H. (2023). ESA-Ariel Data Challenge NeurIPS 2022: Introduction to exo-atmospheric studies and presentation of the Atmospheric Big Challenge (ABC) Database. arXiv preprint arXiv:2206.14633.
Herin, M., Il Idrissi, M., Chabridon, V., & Iooss, B. (2022). Proportional marginal effects for global sensitivity analysis. arXiv preprint arXiv:2210.13065.